An Index - Based Approach for Similarity Search Supporting TimeWarping in Large Sequence
نویسندگان
چکیده
This paper discusses an eeective processing of similarity search that supports time warping in large sequence databases. Time warping enables nding sequences with similar patterns even when they are of diierent lengths. Previous methods for processing similarity search that supports time warping fail to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. They have to scan all the database, thus suuer from serious performance degradation in large databases. Another method that hires the suux tree, which does not assume any distance function, also shows poor performance due to the large tree size. In this paper, we propose a new novel method for similarity search that supports time warping. Our primary goal is to innovate on search performance in large databases without permitting any false dismissal. To attain this goal, we devise a new distance function D tw?lb that consistently underestimates the time warping distance and also satisses the triangular inequality. D tw?lb uses a 4-tuple feature vector that is extracted from each sequence and is invariant to time warping. For eecient processing of similarity search, we employ a multi-dimensional index that uses the 4-tuple feature vector as indexing attributes and D tw?lb as a distance function. We prove that our method does not incur false dismissal. To verify the superiority of our method, we perform extensive experiments. The results reveal that our method achieves signiicant speedup up to 43 times with real-world S&P 500 stock data and up to 720 times with very large synthetic data. The performance gain becomes larger: (1) as the number of data sequences gets larger, (2) the average length of data sequences gets longer, and (3) as the tolerance in a query gets smaller. Considering the characteristics of real databases, these tendencies imply that our approach is suitable for practical applications.
منابع مشابه
A heuristic approach for multi-stage sequence-dependent group scheduling problems
We present several heuristic algorithms based on tabu search for solving the multi-stage sequence-dependent group scheduling (SDGS) problem by considering minimization of makespan as the criterion. As the problem is recognized to be strongly NP-hard, several meta (tabu) search-based solution algorithms are developed to efficiently solve industry-size problem instances. Also, two different initi...
متن کاملارزیابی خودکار جویشگرهای ویدئویی حوزه وب فارسی بر اساس تجمیع آرا
Today, the growth of the internet and its high influence in individuals’ life have caused many users to solve their daily needs by search engines and hence, the search engines need to be modified and continuously improved. Therefore, evaluating search engines to determine their performance is of paramount importance. In Iran, as well as other countries, extensive researches are being performed ...
متن کاملB-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance
Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database...
متن کاملFingerprinting and genetic diversity evaluation of rice cultivars using Inter Simple Sequence Repeat marker
Rice as one of the most important agricultural crops has a putative potential for ensuring food security and addressing poverty in the world. In the present study, in order to provide basic information to improve rice through breeding programs, Inter Simple Sequence Repeat marker (ISSR) was used For DNA fingerprinting and finding genetic relationships among 32 different cultivars. In this study...
متن کاملThe ed-tree: An Index for Large DNA Sequence Databases
The growing interest in genomic research has caused an explosive growth in the size of DNA databases making it increasely challenging to perform searches on them. In this paper, we proposed an index structure called the ed-tree for supporting fast and effective homology searches on DNA databases. The ed-tree is developed to enable probe-based homology search algorithms like Blastn which generat...
متن کامل